Categories

Versions

Split File by Point (Text Processing)

Synopsis

Segments documents by defining the splitting point.

Description

Operator that allows to extract segments from a set of text documents in a directory based on a splitting the single documents into parts. The split point is described by a regular expression.

Input

  • through (File)

    The through port.

Output

  • through (File)

    The through port.

Parameters

  • previewShows a preview for the results which will be achieved by the current configuration. Range:
  • textsA directory containing the documents to be segmented Range:
  • output_directoryThe directory to which to write the segments Range:
  • split_expressionSpecifies the split points in the documents using a regular expression. For example splits on every line break. Range:
  • use_file_extension_as_typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files. Range:
  • content_typeThe content type of the input texts Range:
  • encodingThe encoding used for reading or writing files. Range: